1 Outline of the workshop

  1. Text as data: Brief overview of the theory behind text analysis
  2. The corpus
  3. Analyzing our corpora
  1. Conclusion and parting remarks

Note: The point of this workshop is not to for you to leave an expert on text analysis, but rather for you to have a taste of what can be achieved (i.e. what substantive questions can be answered) when using text analysis techniques. The code provided can help you get started, but you will need to explore each method more in detail if you want to apply it to your research.

1.1 What will you need?

If you want to follow along in your computer, you should have spaCyR intalled. In addition to spaCyR, you should have the following packages installed:

Text Analysis is a computer-intensive task. spaCyR and quanteda processes can consume a lot of your RAM, so take that into account when running your code.

1.2 What are my sources?

Much of the material/ideas for the workshop were taken from the following sources:

If you are interested in applied and/or theoretical readings on text analysis, here is a short list to get you started:

  • Grimmer and Stewart (2013) - Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts in Political Analysis
  • Lucas et al. (2015) - Computer-Assisted Text Analysis for Comparative Politics in Political Analysis
  • Blei (2012) - Probabilistic Topic Models in Communications of the ACM
  • Poetics, Volume 41, Special Issue on “Topic Models and the Cultural Sciences”
  • Slapin and Proksch (2008) - A Scaling Model for Estimating Time-Series Party Positions from Texts in AJPS
  • Welbers et al. (2017) - Text Analysis in R in Communication Methods and Measures

Let’s start!

2 Text as data: Some theory

2.1 Why text?

  1. Politicians love to speak.
  2. Bureaucrat love to write.
  3. Machirulos love to tweet.

There is text in every aspect of politics: debates on legislation, peace treaties, news reports, political manifestos, campaign speeches, social media, etc. Not only is text ubiquitous, but it is produced at the time (sometimes in real time).

2.2 Why use the help of a computer?

  1. Humans are great at understanding and analyzing the content of texts. Computers are not.
  2. Humans are also great at not being able to read thousands of documents in minutes, organize that text, classify that text, scale that text, and then produce pretty graphics from that text. Computers are not not that. (I stand by the double negative)

We will be learning about the latter.

Note: Text analysis is not a field, it is a tool. Think about it as a regression where the data are words instead of numbers. Thus, the fanciest of text analysis techniques is no good without a well thought out, substantive, theoretically motivated question.

2.3 Four principles of automated text analysis (from Grimmer and Stewart (2013))

  1. All Quantitative Models of Language Are Wrong—But Some Are Useful.
  • Data generation process for text is unknown.
  • Language it too complex for computers to correctly decipher (e.g. “Time flies like an arrow. Fruit flies like a banana.”).
  • Since language is so context-specific, more complex models are rarely more useful for the analysis of texts.
  1. Quantitative Methods Augment Humans, Not Replace Them
  • You need to read the text, know the text, be the text.
  • Computers organize, direct, and suggest.
  • Humans read and interpret.
  1. There Is No Globally Best Method for Automated Text Analysis
  • Different needs require different methods.
  • Even when you have the same needs, the same model might not fit the data.
  1. Validate, Validate, Validate
  • Outputs can be misleading (or simply wrong).
  • It is incumbent upon you, the researcher, to validate the use of automated text analysis.

Now we can start with some (sample) text analysis. Note that we will be doing basic analysis (as generalizable as it gets) of text, and then use some models as examples. To see the extent of what can be done with different models, I suggest you read Grimmer and Stewart (2013).

3 The corpus (in theory)

To analyze text, we first need a corpus, a large and structured set of texts for analysis. The units (texts) of the “structured set of texts” can be anything you want them to be: a complete speech, each paragraph of the speech, or each sentence within that paragraph. The relevance of the unit will depend on your research question.

3.1 Text as data: Some specifics

Text data is a sequence of characters called documents. The set of documents is the corpus.

  • Text data is unstructured
  • There is information we want, mixed in with (A LOT) of information we do not.
  • We need to separate the wheat from the chaff.

Note: All text analysis methods will produce some information. The art lies on the ability of the researcher to figure out what is valuable and what is not.

3.2 What counts as a document?

The unit of analysis when using text as data will depend on the question you are asking. For example:

  • If you are looking at how politicians react to different types of economic crises, then each document produced after a crisis would be the unit of analysis.
  • If you are looking at how politicians differ within a campaign, then you might aggregate all the texts produced by a candidates during a campaign as the unit of analysis.
  • If you are looking at how politicians address different topics within a campaign, then your unit of analysis might be a section or paragraph of a manifesto.

3.3 Where can I get my hands on some juicy corpora?

  1. Chris Bail curates a list of corpora already compiled and ready to use.
  2. Governments produce text ALL THE TIME. It is, usually, publicly available and, depending on the country/organization, easily accessible.
  3. Scrape the webs. Websites can make it difficult to scrape data with restrictive terms of use (i.e. bot-blockers or, worse, javascript). There are creative ways to get around these. (If you are interested in web-scrapping, let me know and I can help you get started with some code.)

3.4 Some final words on documents

Original corpora are rarely ready to be analyzed without some previous cleaning. You often want to get rid of hyphenations at line brakes, table of contents, indexes, etc. All of these are corpus-specific and require attention ahead of time.

  • Learn how to use regular expressions (regex). The stringr package in R or the Python package re are useful tools.
  • While useful, regex is tedious to learn. Check out Sanchez 2013 for a good guide.

Some data are only available as non-searchable PDFs or images. These need to be converted to text before R (or Python) can read them. I use ABBYY FineReader, which is ‘expensive’ but might be available at your (old/next) university library. Joe Sutherland has an open-source OCR (Optical Character Recognition).

Finally, I advise against using spell checkers. Most corpora use specialized language that would be flagged by standard spell-checkers (and I have not found one that can be automated to check text in Spanish). In most empirical contexts, we can safely assume that spelling errors (especially OCR errors) are uncorrelated with treatment assignment.

(Ask me about some of the other strengths weaknesses of using data from social media.)

4 The corpus (in practice)

We need texts for text analysis. Luckily, we have a dataset of 5640 tweets from 3885 unique users containing the word ‘capitalism’. The dataset has been tinkered with and cleaned and is ready to be processed.

Before we start, let’s load the required packages. Remember that spaCyR is a Python wrapper that needs to first be initialized.

rm(list=ls(all=TRUE))
library(tidyverse)
library(spacyr)
library(stm)
library(quanteda)
library(quanteda.dictionaries)
library(dplyr)
library(lattice)
library(ggplot2)
library(ggrepel)
library(lme4)
library(ggthemr)
ggthemr("fresh")
spacy_initialize()

Let’s load our dataset and see how it looks.

load("data_capitalism.Rdata")
data_capitalism <- data_capitalism[!data_capitalism$text_clean=="",]
data_capitalism %>% glimpse()
## Observations: 5,627
## Variables: 13
## $ text           <chr> "if you're having trouble understanding capitalism jus…
## $ friendsRT      <int> 637, 576, 674, 116, 2937, 7721, 11768, 3011, 2838, 105…
## $ followersRT    <int> 1126, 3407, 36590, 5080, 6088, 12074, 15690, 3277, 153…
## $ timeRT         <dbl> 18195, 18228, 18223, 18230, 18226, 18222, 18223, 18229…
## $ nameauth       <chr> "shnupz", "aniceburrito", "CaseyExplosion", "HairyBoom…
## $ likeRT         <int> 45195, 4349, 2850, 57, 78, 183, 34, 110, 3750, 133, 37…
## $ retweetRT      <int> 7258, 590, 1031, 5, 16, 41, 16, 44, 1113, 55, 12, 90, …
## $ to_membership  <dbl> 3, 3, 3, 3, 6, 6, 6, 3, 6, 6, 6, 6, 6, 6, 6, 6, 9, 9, …
## $ to_ind         <dbl> 82, 115, 238, 3, 11, 17, 12, 280, 282, 16, 8, 98, 98, …
## $ mem_name       <chr> "POC Anti-Capitalism", "POC Anti-Capitalism", "POC Ant…
## $ mem_name_dummy <chr> "Anti-Capitalism", "Anti-Capitalism", "Anti-Capitalism…
## $ text_clean     <chr> "if you're having trouble understanding capitalism jus…
## $ date_created   <date> 2019-10-26, 2019-11-28, 2019-11-23, 2019-11-30, 2019-…

To analyze the text found in the dataset, we need to create a corpus object. The corpus() function (from the quanteda package) does just this. The main object has to be a character vector. (The readtext() function from the readtext package can read text-formatted files).

cap_corp <- corpus(data_capitalism$text_clean, 
                       docvars = data.frame(author = data_capitalism$nameauth,
                                            time = data_capitalism$timeRT,
                                            in_degree = data_capitalism$to_ind,
                                            followers = data_capitalism$followersRT,
                                            RT_count = data_capitalism$retweetRT,
                                            left_right = data_capitalism$mem_name,
                                            left_right_dum = data_capitalism$mem_name_dummy),
                       metacorpus = list(source = "Twitter",
                                         notes = "Scrapped on Nov. 30 - Dec. 3, 2019"))

summary(cap_corp,10) 

A corpus object works similar to a normal dataset, in that each document (in our case, each tweet) is an observation that has additional covariates (i.e. docvars) that describe it.

cap_corp[c(1:5)]
##                                                                                                                                                                                                                                                                                                text1 
##                                                                                                                                                                 "if you're having trouble understanding capitalism just remember there are only 4 websites you use in 2019 and you hate all of them" 
##                                                                                                                                                                                                                                                                                                text2 
##                                                                                                                                                                                                                     "\"capitalism not cronyism\" poster really pushes this over the top. Beautiful " 
##                                                                                                                                                                                                                                                                                                text3 
##                             "\"Fascism is capitalism in decay.\"\n\nI didn't even understand what that meant a few years back, but the wealthy funnelling money to white nationalists as \"charity\" so they can avoid paying tax, that's about as on the nose an example as I could've hoped for. " 
##                                                                                                                                                                                                                                                                                                text4 
##                                                                                                                                                                                                                                                            "What if video games, but too capitalism" 
##                                                                                                                                                                                                                                                                                                text5 
## "@davidsirota No stage - it's just capitalism. Stages aren't distinguishable with capitalism. Only its roots: power, greed and materialism; its methods: theft, exploitation and oppression; and, its outcomes: plutocracy, inequality and ecological destruction. Capitalism itself is the crisis."

You can explore a corpus as you would explore other lists (with brackets). To get all the texts you can texts(cap_corp), but you don’t want to do that. We can subset a corpus. Here are all the tweets from TW authorities with more than 90K followers (and less than 200 words… I know, useless, but you might appreciate the code to do it):

cap_subset <- corpus_subset(cap_corp, 
                                followers > 90000 & ntype(cap_corp) < 200)
summary(cap_subset, 15)

We might be interested in sentences rather than complete tweets.

cap_sentences <- corpus_reshape(cap_corp, to = "sentences") # or "paragraphs"
summary(cap_sentences,8)
cap_sentences[1:5]
##                                                                                                                                                                                                                          text1.1 
##                                                                                             "if you're having trouble understanding capitalism just remember there are only 4 websites you use in 2019 and you hate all of them" 
##                                                                                                                                                                                                                          text2.1 
##                                                                                                                                                            "\"capitalism not cronyism\" poster really pushes this over the top." 
##                                                                                                                                                                                                                          text2.2 
##                                                                                                                                                                                                                      "Beautiful" 
##                                                                                                                                                                                                                          text3.1 
##                                                                                                                                                                                            "\"Fascism is capitalism in decay.\"" 
##                                                                                                                                                                                                                          text3.2 
## "I didn't even understand what that meant a few years back, but the wealthy funnelling money to white nationalists as \"charity\" so they can avoid paying tax, that's about as on the nose an example as I could've hoped for."

Since these are tweets and people often tweet one sentence at a time, this conversion might be moot for our corpus. But in longer texts narrowing down the unit of analysis can be helpful, especially if we are trying to estimate topic models (more on this later).

The reshape function can divide texts into “sentences” and “paragraphs”. corpus_reshape() uses punctuation marks (e.g. “\\n”, “\n”) to determine cuts.

5 Pre-processing the corpus

As previously mentioned, our corpora have the information we want, and a lot of information we do not. Uninformative data add noise and reduce the precision of resulting estimates (and are computationally costly). We aim to have a “bag-of-words”, or to convert a corpus D to a matrix X. In the “bag-of-words” representation, a row of X is just the frequency distribution over words in the document corresponding to that row.

Before we do that, we will get rid of all the unwanted information. First, we turn all words to lower-case and get rid of all punctuation and numbers. The tokens() command will do this and separate all the texts into tokens.

cap_toks <- tokens(cap_corp,
                   remove_numbers = T,
  remove_punct = T,
  remove_separators = TRUE,
  remove_twitter = T,
  remove_hyphens = T,
  remove_url = T)

Tokens (ntoken) is a fancy name for “word”. These contain all the information we need to run our models. Yet, there is still a lot of noise.

cap_coll <- textstat_collocations(cap_toks)
head(cap_coll, 20)

(Quick detour: Collocations bundle together a set number of words -also known as ngrams- that appear next to each other. The default is 2, but we can set it at any length we want.)

Some words are “useless”. Let’s get rid of the stopwords. (For a take on when stopwords are informative, check Pennebaker (2011).)

cap_toks_stop <- tokens_remove(cap_toks, 
                              stopwords(language = "en"),
                              padding = F)
cap_toks_stop <- tokens_remove(cap_toks_stop, "amp")
cap_toks_stop <- tokens_remove(cap_toks_stop, "capitalism")
cap_coll <- textstat_collocations(cap_toks_stop)
head(cap_coll, 20)

Finally, we might want to stem our tokens.

Example of stemming

Example of stemming

cap_toks_stem <- tokens_wordstem(cap_toks_stop)
cap_coll <- textstat_collocations(cap_toks_stem)
head(cap_coll, 20)

5.1 Before the “bag-of-words”

We have just pre-processed the data, but the number of documents (e.g. tweets) and the (pre-processed) length of these documents already provide an interesting set of variables for analysis.

For example: - How do major capitalist events affect the production of tweets from capitalists and anti-capitalists? - What is the relation between in-degree and effort and quality of tweets?

# How do major capitalist events affect the production of tweets from capitalists and anti-capitalists?

data_capitalism <- data_capitalism %>% 
  group_by(date_created,mem_name_dummy) %>%
   mutate(count_tweets = n())

data_capitalism_res <- data_capitalism[!duplicated(data_capitalism[c(11,13)]),]
data_capitalism_res <- data_capitalism_res[data_capitalism_res$date_created > "2019-11-19",]

ggplot(data_capitalism_res, aes(x=date_created , y = count_tweets , color = mem_name_dummy)) +
         stat_smooth() +
    scale_color_discrete(name = "") +
  labs(x = "Date", y = "Count of Tweets") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position="bottom") +
  geom_vline(xintercept = as.Date("2019-11-29"), color = "black", linetype = "dashed") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# What is the relation between in-degree and effort of tweets? 

data_capitalism$length_tweet <- ntoken(char_tolower(data_capitalism$text_clean),
                                       remove_punct = TRUE)

data_capitalism$length_type <- ntoken(char_tolower(data_capitalism$text_clean),
                                       remove_punct = TRUE) #repeated words are cut..

ggplot(data_capitalism, aes(x=to_ind , y =  length_tweet)) +
         geom_point() +
     geom_smooth(method = "lm", se = T)+
  labs(x = "In-Degree", y = "Length of Tweets")  

ggplot(data_capitalism, aes(x=to_ind , y =  length_type)) +
         geom_point() +
     geom_smooth(method = "lm", se = T)+
  labs(x = "In-Degree", y = "Number of Types per Tweets")  

6 Analyzing our corpus: Keywords in context and collocations

Finally, we are ready to analyze our corpus. We will start by the basic: keywords in context.

6.1 Keywords in context

Say we are interested in the way capitalist and anti-capitalist describe capitalism. Let’s see how they talk about capitalism using the “keyword-in-context” function:

cap_talk <- kwic(cap_toks_stop, "capital*", window=10) # 10 words before and after the word in question
head(cap_talk, 20)

If we put together all these words in one whole text, we can see how they are related in that context.

cap_talk_context <- paste(cap_talk$pre, cap_talk$post, collapse = " ")
cap_talk_coll <- textstat_collocations(cap_talk_context, size = 3)
head(cap_talk_coll, 20)

As a reference, the \(\lambda\) score is a measure of the times K specific consecutive tokens happen (in our second example, \(K=3\)), given all the K consecutive tokens possibilities.

7 Analyzing a corpus: Features

The features of a corpus can give us clues about the characteristics of our text. At this point, we are already treating our documents as “bags-of-words”. Having tokenized our corpus means that we are interested in each word, how often they appear, in conjunction with what other words, etc. We are going to creaate a Document-Feature Matrix (dfm) object. I will going into detail of what is a dfm. For now, just think of it a dataset where each row if the count of words of each document.

dfm_toks <- dfm(cap_toks_stop)

Note that my dfm object contains all the document level variables I added to my corpus at the beginning.

7.1 (10) Most frequently used words

Simply:

cap_freq <- textstat_frequency(dfm_toks, n = 5, groups = "left_right")
head(cap_freq, 15)

Or a plot of the most frequently used words by group:

dfm_toks %>% 
  textstat_frequency(n = 10,groups = "left_right") %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency, color = group)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency")

7.2 Lexical diversity

We might be interested in the breadth and variety of vocabulary used in a document. Lexical diversity is widely believed to be an important parameter to rate a document in terms of textual richness and effectiveness. Knowing Marxists, we might expect them to use more complex language than capitalists.

dfm_toks_2 <- dfm(cap_toks)

cap_lexdiv <- textstat_lexdiv(dfm_toks_2,measure = "TTR")
tail(cap_lexdiv, 5) # Does not tell us much, but...
to_plot <- cbind.data.frame(data_capitalism, cap_lexdiv)

ggplot(to_plot, aes(x= factor(mem_name), y=TTR)) +
  geom_boxplot() +
    labs(x = "Type")  

There are many measures of lexical diversity, each measuring something slightly different. textstat_lexdiv() includes many and you can check which adapts best to your needs here.

7.3 TF-IDF

Looking at simple frequencies might hide some important document features. As in everything in the social sciences, we can always complicate it a bit more. Enter TF-IDF: “Term-frequency / Inverse-document-frequency”. TF-IDF weighting up-weights relatively rare words that do not appear in all documents. Using term frequency and inverse document frequency allows us to find words that are characteristic for one document within a collection of documents.

head(dfm_toks[, 5:15])
## Document-feature matrix of: 6 documents, 11 features (83.3% sparse).
## 6 x 11 sparse Matrix of class "dfm"
##        features
## docs    websites use hate cronyism poster really pushes top beautiful fascism
##   text1        1   1    1        0      0      0      0   0         0       0
##   text2        0   0    0        1      1      1      1   1         1       0
##   text3        0   0    0        0      0      0      0   0         0       1
##   text4        0   0    0        0      0      0      0   0         0       0
##   text5        0   0    0        0      0      0      0   0         0       0
##   text6        0   0    0        0      0      0      0   0         0       0
##        features
## docs    decay
##   text1     0
##   text2     0
##   text3     1
##   text4     0
##   text5     0
##   text6     0
head(dfm_tfidf(dfm_toks)[, 5:15])
## Document-feature matrix of: 6 documents, 11 features (83.3% sparse).
## 6 x 11 sparse Matrix of class "dfm"
##        features
## docs    websites      use     hate cronyism   poster   really   pushes      top
##   text1 3.449247 1.869463 1.701059 0        0        0        0        0       
##   text2 0        0        0        2.231763 2.495004 1.568433 3.273156 2.258915
##   text3 0        0        0        0        0        0        0        0       
##   text4 0        0        0        0        0        0        0        0       
##   text5 0        0        0        0        0        0        0        0       
##   text6 0        0        0        0        0        0        0        0       
##        features
## docs    beautiful  fascism    decay
##   text1  0        0        0       
##   text2  2.847187 0        0       
##   text3  0        1.858182 2.708884
##   text4  0        0        0       
##   text5  0        0        0       
##   text6  0        0        0

If we are building a dictionary, for example, we might want to include words with high TF-IDF values. Another way to think about TF-IDF is in terms of predictive power. Words that are common to all documents do not have any predictive power and receive a TD-IDF value of 0. Words that appear, but only in relatively few document, have greater predictive power and receive a TD-IDF > 0.

7.4 Wordclouds

Wordclouds are silly, but people seem to love them. Begrudgingly, I include the code:

# In order to group the wordcloud I will create a new dfm with only one group:
dfm_toks_wordcloud <- dfm(cap_toks_stop, 
                          groups = "left_right")

# comparison = T divides the word cloud into groups:
textplot_wordcloud(dfm_toks_wordcloud, comparison = T, max_words = 300)

8 Analyzing a corpus: Dictionary-based approaches

Dictionaries help us connect qualitative (concepts) and quantitative information extracted from text. Constructing a dictionary requires contextual interpretations. The key in a dictionary for text analysis (more of a thesaurus) is associated with non-exclusive terms (values):

Formally, there are three major categories:

You can create your dictionary (using the dictionary() function), or you can use well-known dictionaries like: General Inquirer (Stone et al. 1996), Regressive Imagery Dictionary (Martindale, 1975, 1990), Linguistic Inquiry and Word Count, Laver and Garry (2000) to distinguish policy domains, Lexicoder Sentiment Dictionary (Young and Soroka, 2012). All dictionaries have drawbacks and may or may not adequately capture what you want them to capture. Remember: validate, validate, validate.

8.1 Corpus-specific dictionary

Let’s apply the dictionary used by Pearson and Dancey (2011) to our corpus and see if there is any gender element to the way each side addresses capitalism.

dict_women <- dictionary(list(mujeres = c("woman", "women", "girl*", "female*")))

cap_women <- liwcalike(data_capitalism$text_clean, 
                               dictionary = dict_women)

to_plot <-  cbind.data.frame(cap_women,data_capitalism)
to_plot$women_dum <- 0 
to_plot$women_dum[to_plot$mujeres>0] <- 1

ggplot(to_plot, aes(x=mem_name, y = mujeres)) +
    geom_boxplot() +
    labs(x = "Type") 

ggplot(to_plot, aes(x=mem_name, y = women_dum)) +
    geom_bar(stat = "identity") +
    labs(x = "Type", y = "Times Women are Mentioned" ) 

“The patriarchy is effing vast…”

8.2 LIWC and Sentiment Analysis

Let’s do some simple “sentiment analysis” using two dictionaries: GI and NRC.

cap_sentimentNRC <- liwcalike(cap_corp, 
                               dictionary = data_dictionary_NRC)

cap_sentimentGI <- liwcalike(cap_corp, 
                               dictionary = data_dictionary_geninqposneg)

head(cap_sentimentNRC, 15)
head(cap_sentimentGI, 15)

Both produce dataframe objects that can later be manipulated to conduct more in-depth analysis. Let’s see how “anger” and “joy” language varies across groups and time using the NRC dictionary.

cap_sentiment_df <- cbind.data.frame(cap_sentimentNRC,data_capitalism)
cap_sentiment_df <- cap_sentiment_df[cap_sentiment_df$date_created > "2019-11-19",]

ggplot(cap_sentiment_df, aes(x=date_created, y=anger, color = mem_name))+
  stat_smooth() +
  labs(title="Anger in Capitalism", x="Date", y = "Anger", las=2)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(cap_sentiment_df, aes(x=date_created, y=joy, color = mem_name))+
  stat_smooth() +
  labs(title="Joy in Capitalism", x="Date", y = "Joy", las=2)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

One of the limitations of sentiment analysis is the variation we get on similar sentiments when using different dictionaries. For example, analyzing polarity in our data (positive language minus negative language), we obtain different results from GI and NRC.

polarity_NRC <- cap_sentimentNRC$positive - 
cap_sentimentNRC$negative

polarity_GI <- cap_sentimentGI$positive - 
cap_sentimentGI$negative

cor(polarity_NRC,polarity_GI)
## [1] 0.5070789

As a general recommendation, it is always good practice to explore the dictionary we are going to use before actually using it. This is particularly important when working with text in another language. Most dictionaries (commercial or otherwise) that have been evaluated and validated, have been tested and evaluated in English. If using another language, it might be a good idea to check how words are classified. Another common practice is to use Google Translate API or Microsoft Translator API on the original corpus. This can be costly depending on the size of the corpus and the results, like with any other text-analysis tool, need to be constantly validated (Lucas et al. 2015).

9 Analyzing a corpus: Topic models and document distance

Topics models were primarily developed to summarize unstructured text, use words within documents to infer their topic, and as a form of dimension reduction. It allows social scientists to use topics as a form of measurement, as we are often interested in how observed covariates drive trends in language. For example,

Topic models are “unsupervised” methods. As such, they require a great deal of human supervision, especially when it comes to validating the results.

9.1 Document-term Matrix

Topic models are a broad class of Bayesian generative models that encode problem-specific structure into an estimation of categories (Grimmer and Stewart 2013; Blei et al. 2010). Statistically, a topic is a probability mass function over words. The idea of topic models is that each document exhibits a topic in some proportion: each document is a distribution over topics, and each topic is a distribution over words.

We can take our corpus and turn it into a matrix that reflects this concept. We call it a document-term matrix (dtm).

Matrix D

Matrix D

  • A corpus of \(n\) documents \(D_1\), \(D_2\), \(D_3\)\(D_n\)
  • Vocabulary of \(m\) words \(W_1\), \(W_2\), \(W_3\)\(W_m\)
  • The value of \(i,j\) cell gives the frequency count of word \(W_j\) in document \(D_i\)

Latent Dirichlet Allocation (LDA) converts the dtm into two lower-dimension matrices, \(M_1\) and \(M_2\):

Matrix M1

Matrix M1

Matrix M2

Matrix M2

  • \(M_1\) is a \(N x K\) document-topic matrix
  • \(M_2\) is a \(K x M\) topic-term matrix

LDA estimates the distribution over words for each topic, and the proportion of a topic in each document. You can read about the math behind the two-step process in Grimmer and Stewart (2013).

Plate notation of Latent Dirichlet Allocation

Plate notation of Latent Dirichlet Allocation

  • \(\alpha\) –> document-topic density: higher \(\alpha\) means documents contain more topics, lower \(\alpha\) means a document contains fewer topics.
  • \(\beta\) –> topic-word density. higher \(\beta\) means topics have more words, lower \(\beta\) means topics have fewer words.
  • Number of topics –> this is specified in advance, or can be chosen to optimize model fit (we will get back to this point). The “statistically optimal” topic count is usually too high for the topics to be interpretable or useful.

9.2 Structural Topic Models (STM)

To apply (and visualize) LDA, we are going to be using an extension of LDA known as Structural Topic Models (STM).

\[ STM = LDA + Metadata \]

STM provides two ways to include contextual information to “guide” the estimation of the model. First, topic prevalence can vary by metadata (e.g. Republicans talk about military issues more than Democrats). Second, topic content can vary by metadata (e.g. Republicans talk about military issues differently from Democrats).

We can run STM using the stm package. The stm package includes the complete workflow (i.e. from raw text to figures), and if you are planning to use it in the future I highly encourage you to check this and this. stm() takes our dfm and produces topics. If we do not specify any prevalence terms, then it will estimate an LDA. Since this is a Bayesian approach, it is recommended you set a seed value for future replication. We also need to set \(K\) number of topics. How many topics are the right number of topics? There is no good number. Too many pre-specified topics and the categories might be meaningless. Too few, and you might be piling together two or more topics. Note that changes to a) the number of topics, b) the prevalence term, c) the omitted words, d) the seed value, can (greatly) change the outcome. Here is where validation becomes crucial (for a review see Wilkerson and Casas 2017).

Using our dataset, I will use stm to estimate the topics surrounding “capitalism” on Twitter. As my prevalence term, I add the position of each authority. I set my number of topics at 10 (but with a corpus this big I should probably set it at ~30 and work my way up from there).

dfm_toks_stem <- dfm(cap_toks_stem)

# After trimming, some documents were left with no tokens so I will eliminate those before running my model:
dfm_toks_sub <- dfm_subset(dfm_toks_stem, ntoken(dfm_toks_stem) > 0)

cap_TM <- stm(dfm_toks_sub, K = 10, seed = 1984,
              prevalence = ~left_right,
              init.type = "Spectral")

The nice thing about the stm() function is that it allows us to see in “real-time” what is going on within the black box. We can summarize the process in the following way (this is similar to a collapsed Gibbs sampling, which the stm() function sort of uses):

  1. Go through each document, and randomly assign each word in the document to one of the topics \(\displaystyle t\in k\).

  2. Notice that this random assignment already gives both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).

  3. So to improve on them, for each document \(\displaystyle W\) do the following: 3.1 Go through each word \(\displaystyle w\) in \(\displaystyle W\) 3.1.1 And for each topic \(\displaystyle t\), compute two things: 3.1.1.1 \(\displaystyle p(t|W)\) = the proportion of words in document \(\displaystyle W\) that are currently assigned to topic \(\displaystyle t\), and 3.1.1.2 \(\displaystyle p(w|t)\) = the proportion of assignments to topic \(\displaystyle t\) over all documents that come from this word \(\displaystyle w\).

    Reassign \(\displaystyle w\) a new topic, where we choose topic \(\displaystyle t\) with probability \(\displaystyle p(t|W)*p(w|t)\). It is worth noting that according to our generative model, this is essentially the probability that topic \(\displaystyle t\) generated word \(\displaystyle w\), so it makes sense that we resample the current word’s topic with this probability. (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)

    3.1.1.3 In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.

  4. After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated with each topic (by counting the proportion of words assigned to each topic overall).

(This explanation was taken from here). Let’s explore the topics produced:

labelTopics(cap_TM)
## Topic 1 Top Words:
##       Highest Prob: peopl, like, good, end, never, look, realli 
##       FREX: like, radic, gift, art, vegan, digit, empire 
##       Lift: 80s, aral, chichisup, coffin, dutch, etcetera, feet 
##       Score: like, peopl, gift, look, good, diana, princess 
## Topic 2 Top Words:
##       Highest Prob: social, markruffalo, rich, million, kill, mark, socialist 
##       FREX: markruffalo, million, mark, ruffalo, worth, millionair, hollywood 
##       Lift: deepstat, 30m, breitbartnew, bridgetphetasi, clue, cronyism, darkwatersmovi 
##       Score: markruffalo, mark, ruffalo, million, kill, worth, social 
## Topic 3 Top Words:
##       Highest Prob: just, say, fuck, know, problem, much, american 
##       FREX: fuck, black, love, leftist, friday, u, sex 
##       Lift: asshol, bent, blk, bourgeois, chart, custom, dick 
##       Score: fuck, just, hate, black, friday, love, problem 
## Topic 4 Top Words:
##       Highest Prob: social, market, free, state, anti, climat, chang 
##       FREX: climat, global, differ, u.s, competit, challeng, threat 
##       Lift: 🤷🏽‍♀, 150th, 17th, 1990s, 1a2a, 4human, 4sale 
##       Score: market, state, climat, free, social, global, stage 
## Topic 5 Top Words:
##       Highest Prob: get, money, go, now, give, everyon, busi 
##       FREX: money, ethic, pretend, speak, disgust, get, sleep 
##       Lift: agreed, artworkte, barackobama, ben, bigot, blatant, brianstelt 
##       Score: get, money, go, give, ethic, everyon, bad 
## Topic 6 Top Words:
##       Highest Prob: system, need, human, world, profit, live, exploit 
##       FREX: imperi, resourc, individu, relationship, growth, longer, altern 
##       Lift: agricultur, anarchist_black, arab, autonomi, bake, bolivian, cathol 
##       Score: system, human, profit, resourc, exploit, product, natur 
## Topic 7 Top Words:
##       Highest Prob: work, can, us, worker, time, class, better 
##       FREX: ewarren, debt, labour, insur, hour, corrupt, care 
##       Lift: catastrophe, davi, ewarren, insecur, mitig, patient, unafford 
##       Score: us, work, can, corrupt, pay, warren, poor 
## Topic 8 Top Words:
##       Highest Prob: freedom, berni, public, polici, protest, capit, failur 
##       FREX: berni, polici, protest, failur, sander, sensand, leecamp 
##       Lift: pete, _michelangelo__, (ு, 🧟‍♀, 🧟‍♂, 11warrior, 13_moth 
##       Score: zerogbadillion, jimmy_dor, ajc4other, alllibertynew, blysx, cptseamonkey, decakarjeffrey 
## Topic 9 Top Words:
##       Highest Prob: make, think, right, even, labor, part, instead 
##       FREX: instead, fuel, imposs, consequ, content, vision, cheap 
##       Lift: 💁‍♀, 💁‍♂, 🙍‍♀, 🙍‍♂, analyz, beckert, behaviour 
##       Score: make, think, labor, even, right, instead, el 
## Topic 10 Top Words:
##       Highest Prob: thing, way, also, creat, power, societi, great 
##       FREX: thing, seem, issu, anandwrit, hierarchi, essay, racism 
##       Lift: acceleration, ambit, assert, blackston, calebmaupin, copyright, demis 
##       Score: thing, also, creat, racism, societi, way, power

FREX weights words by their overall frequency and how exclusive they are to the topic. Lift weights words by dividing by their frequency in other topics, therefore giving higher weight to words that appear less frequently in other topics. Similar to lift, score divides the log frequency of the word in the topic by the log frequency of the word in other topics (Roberts et al. 2013). Bischof and Airoldi (2012) show the value of using FREX over the other measures.

You can use the plot() function to show the topics.

plot(cap_TM, type = "summary", labeltype = "frex") # or prob, lift score

If you want to see a sample of a specific topic:

findThoughts(cap_TM, texts = texts(cap_corp)[docnames(dfm_toks_sub)], topics = 4)  
## 
##  Topic 4: 
##       "Neoliberalism is the 20th-century resurgence of 19th-century ideas associated with laissez-faire economic liberalism and free market capitalism, which constituted a paradigm shift away from the post-war Keynesian consensus that had lasted from 1945 to 1980."
## 
## Not that vague Nate 
##      @LuckyHeronSay @LibDems The LibD`s,Tories &amp; anybody who supports capitalism &amp; the cuts etc.HAS to lie.The MSM backs them up. 
## Only full blooded socialism,the truth--&amp; an IMPLEMENTED socialist alt. will suffice 2 beat them.
## Blairites have IMPLEMENTED Tory cuts etc.
## Lab could be in trouble in their areas.
##      @PHPatriot_1 @zoothorn69 The connection between #Christianity, #capitalism, our #Constitution, and the basis of law and order in western civilization isn't taught due to a false notion of separation of #church and state, &amp; the necessity to demonize it to advance #progressivism.

We can (should/must) run some diagnostics. There are two qualities that were are looking for in our model: semantic coherence and exclusivity. Exclusivity is base on the FREX labeling metrix. Semantic coherence is a creterion developed by Mimno et al. (2011) and it maximizes when the most probable words in a given topic frequently co-occur together. Mimno et al. (2011) show that the metric correlates well with human judgement of topic quality. Yet, it is fairly easy to obtain high semantic coherence so it is important to see it in tandem with exclusivity. Let’s see how exclusive are the words in each topic:

dotchart(exclusivity(cap_TM), labels = 1:10)

We can also see the semantic coherence of our topics –words a topic generates should co-occur often in the same document–:

dotchart(semanticCoherence(cap_TM,dfm_toks_sub), labels = 1:10)

We can also see the overall quality of our topic model:

topicQuality(cap_TM,dfm_toks_sub)
##  [1] -153.0452 -112.0533 -156.6526 -152.9175 -144.6928 -148.9168 -149.8053
##  [8] -222.4208 -187.0373 -155.3585
##  [1] 9.617573 9.436403 9.562088 9.482883 9.343743 9.494762 9.505163 9.616828
##  [9] 9.345646 9.572110

On their own, both metrics are not really useful (what do those numbers even mean?). They are useful when we are looking for the “optimal” number of topics.

cap_TM_0 <- manyTopics(dfm_toks_sub,
                       prevalence = ~ left_right,
                       K = c(10,15,20), runs=2,
                       max.em.its = 50, 
                       init.type = "Spectral") # It takes around 250 iterations for the model to converge. I limit the number of iterations for time/space but you should allow the models to converge.

We can now compare the perfomance of each model based on their semantic coherence and exclusivity:

k_10 <- cap_TM_0$out[[1]] # k_10 is an stm object which can be explored and used like any other topic model. 
k_15 <- cap_TM_0$out[[2]]
k_20 <- cap_TM_0$out[[3]]

# I will just graph the 'quality' of each model:
topicQuality(k_10,dfm_toks_sub)
##  [1] -156.5871 -110.0929 -156.5661 -169.9799 -158.6913 -158.8270 -151.5906
##  [8] -222.2478 -179.9702 -180.8846
##  [1] 9.635785 9.398349 9.591756 9.424861 9.312281 9.591102 9.473077 9.528538
##  [9] 9.387953 9.653311

topicQuality(k_15,dfm_toks_sub)
##  [1] -176.3001 -117.4834 -184.9537 -158.2778 -169.1092 -216.5443 -166.8171
##  [8] -188.7352 -235.7356 -141.5557 -200.8848 -176.1947 -154.4283 -170.0446
## [15] -183.5428
##  [1] 9.752277 9.752145 9.806760 9.624878 9.615049 9.353169 9.652194 9.655456
##  [9] 9.689886 9.686355 9.597133 9.759681 9.661525 9.349216 9.976027

topicQuality(k_20,dfm_toks_sub)
##  [1] -155.8810 -165.9696 -207.8772 -142.5555 -106.7070 -157.1547 -153.8948
##  [8] -169.3887 -155.9115 -213.4163 -169.2706 -200.3757 -175.9227 -168.2318
## [15] -171.0044 -249.4002 -159.4595 -190.0719 -183.0467 -256.6249
##  [1] 9.751484 9.536017 9.698744 9.651605 9.721448 9.893554 9.533039 9.723660
##  [9] 9.769429 9.422858 9.648541 9.854435 9.718662 9.553076 9.872522 9.895744
## [17] 9.851317 9.818935 9.816059 9.865039

Maybe we have some theory about the difference in topic prevalence across sides (or across parties). We can see the topic proportions in our topic model object:

head(cap_TM$theta)
##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] 0.07542875 0.02024515 0.49208606 0.07776768 0.09624133 0.06542261
## [2,] 0.15249491 0.18153916 0.08170145 0.03862463 0.30355151 0.05864056
## [3,] 0.16350298 0.13006020 0.07387521 0.07534854 0.12696537 0.10668717
## [4,] 0.31965533 0.03596207 0.14965279 0.06928192 0.09203642 0.09834901
## [5,] 0.02665616 0.01267243 0.03409487 0.50179732 0.01986142 0.23534687
## [6,] 0.02036968 0.01553662 0.01370014 0.04234740 0.01344656 0.02801444
##            [,7]        [,8]       [,9]      [,10]
## [1,] 0.06764590 0.006605026 0.05305056 0.04550695
## [2,] 0.09278333 0.007145652 0.03322783 0.05029096
## [3,] 0.16127230 0.009931679 0.06931750 0.08303905
## [4,] 0.08599592 0.009958856 0.05526251 0.08384517
## [5,] 0.05048596 0.015786213 0.01677402 0.08652474
## [6,] 0.03254694 0.797484935 0.01129259 0.02526068

What about connecting this info to our dfm and see if there are differences in the proportion topic 6 is addressed by each side.

cap_prev <- data.frame(topic2 = cap_TM$theta[,2], docvars(dfm_toks_sub))
lmer_topic2 <- lmer(topic2 ~ (1 | left_right), data = cap_prev)
dotplot(ranef(lmer_topic2, condVar = TRUE))
## $left_right

Makes sense: anti-capitalist bunch together and away from capitalists. We can do something similar with the stm function directly. We just need to specify the functional form and add the document variables.

cap_topics <- estimateEffect(c(2,4,7) ~ left_right_dum, cap_TM, docvars(dfm_toks_sub)) # You can compare other topics by changing c(6,9). 
plot(cap_topics, "left_right_dum", method = "difference",
     cov.value1 = "Anti-Capitalism", 
     cov.value2 = "Capitalism",
     labeltype = "custom",
     xlab = "More Liberal ... More Conservative",
     custom.labels = c('E. Warren', 'M. Ruffalo','Climate'),
     model = cap_TM)

10 Analyzing a corpus: Scaling models

So far, we have explored tools that provide information about the text. We can also use the text to obtain information about the authors. The Wordfish model developed by Slapin and Proksch (2008), for example, positions the authors of documents in an (ideological) scale. How? In politics, the frequency with which politician \(i\) uses word \(k\) is drawn from a Poisson distribution:

\[ w_{ik} ∼ Poisson(\lambda _{ik})\] \[\lambda_{ik} = exp(α_i +ψ_k +β_k ×θ_i)\]

with latent parameters:

The parameters of interest are the \(θ\)’s, the position of the parties in each election year, and the \(β\)’s because they allow us to analyze which words differentiate between party positions.

The main assumption is that, indeed, \(\lambda_{ik}\) is generated by the parameters previously described. Let’s believe for a second that the peer-review system works and use the textmodel_wordfish() function to estimate the positions of our authorities in our corpus.

## I will subset my data since as it is there are too many documents:

data_capitalism <- data_capitalism %>%
  group_by(nameauth) %>%
  mutate(count_auth = n())

data_capitalism_sub <- data_capitalism[data_capitalism$count_auth>8,]

## I will also concatenate all the tweets by author:
data_capitalism_sub <- data_capitalism_sub %>%
  group_by(nameauth) %>%
  mutate(text_conca = paste0(text_clean, collapse = " "))

## And finally drop dups:
data_capitalism_sub <- data_capitalism_sub[!duplicated(data_capitalism_sub$nameauth),]

cap_corp_sub <- corpus(data_capitalism_sub$text_conca,
                       docvars = data.frame(author = data_capitalism_sub$nameauth,
                        left_right = data_capitalism_sub$mem_name                   ))

## Again my dfm
cap_dmf_sub <- dfm(cap_corp_sub,
 remove_punct = TRUE,
 remove_numbers = TRUE,
 remove = stopwords("english"),
 stem = T) 

cap_wfish <- textmodel_wordfish(cap_dmf_sub, dir = c(1,5)) #Does not really matter what the starting values are (dir=c()). At least, should not matter (other than the sign). 

summary(cap_wfish)
## 
## Call:
## textmodel_wordfish.dfm(x = cap_dmf_sub, dir = c(1, 5))
## 
## Estimated Document Positions:
##           theta      se
## text1   0.27208 0.06189
## text2  -0.99178 0.04885
## text3  -0.77402 0.04343
## text4   0.37626 0.06202
## text5   0.53266 0.05240
## text6   0.54913 0.06721
## text7  -1.70036 0.01783
## text8  -1.04105 0.05488
## text9   1.18058 0.04790
## text10  0.46719 0.10257
## text11 -0.99258 0.06303
## text12  0.92237 0.07797
## text13  0.07457 0.07749
## text14 -0.72464 0.07788
## text15 -0.70685 0.07443
## text16 -0.45435 0.11727
## text17  0.39764 0.09232
## text18  0.26899 0.08314
## text19 -1.07466 0.06112
## text20 -0.10482 0.24448
## text21  1.61314 0.03896
## text22  0.15266 0.05828
## text23  0.30705 0.11761
## text24  1.34803 0.03473
## text25  1.36736 0.05965
## text26  2.14261 0.02236
## text27 -1.08596 0.04948
## text28 -0.67270 0.04066
## text29 -0.86851 0.06740
## text30  1.00146 0.04266
## text31 -1.78151 0.01650
## 
## Estimated Feature Scores:
##        trump conserv republican democrat    amp  liber    link  common    bond
## beta  0.1831  0.3257    -0.1235   0.3558 0.7277 -1.428  0.8157 -0.1832  0.3446
## psi  -1.2239 -1.7671    -3.0482  -1.7695 0.6283 -2.368 -3.1269 -2.1448 -2.3282
##        capit  imperi  coloni   white supremaci patriarchi jingoism   allow
## beta 0.09024 -0.9719 -0.4575  0.4393   -0.1455    -0.1455   0.5496 -0.1235
## psi  2.40763 -2.0284 -2.7410 -1.6457   -3.0527    -3.0527  -3.7443 -3.0482
##      maintain   relat  luxuri lifestyl  expens  global    major planet   thus
## beta  -0.2717 -0.3965  0.1206    1.413 -0.1662  0.5085  0.01487  1.195  0.626
## psi   -3.0839 -2.0252 -2.6122   -2.523 -2.3640 -0.8460 -0.94706 -2.051 -2.376
##         call  realli personifi virtual
## beta  0.5985  0.3681   -0.3714 -0.3714
## psi  -1.3573 -0.9440   -3.1151 -3.1151

This is an interesting exercise, since the reasoning behind wordfish is similar to the one behind network analysis. In network analysis, rather than looking at the text, we look at the connections made from Tweets and ReTweets. Let’s see how both approaches comapare:

cap_preds <- predict(cap_wfish, interval = "confidence")
cap_pos <- data.frame(docvars(cap_corp_sub), 
                       cap_preds$fit) %>%
  arrange(fit)

cap_pos <- cap_pos[order(cap_pos$fit),] # sort
 
ggplot(cap_pos, aes(x = fit, y = 1:nrow(cap_pos), xmin = lwr, xmax = upr,color = left_right)) +
   geom_point() +
   geom_errorbarh(height = 0) +
   scale_y_continuous(labels = cap_pos$docs, breaks = 1:nrow(cap_pos)) +
   labs(x = "Position", y = "User") +
   ggtitle("Estimated Positions")

External validation, beibi! (?)

We can also turn around the scaling and see where each word is positioned on the same left-right scale as the authors. Here is the “Eiffel Tower” Slapin and Proksch (2008) and of scaled words:

wscores <- data.frame(word = cap_wfish$features,
                        score = cap_wfish$beta,
                        offset = cap_wfish$psi)

wscores <- wscores[wscores$score<5,]

testwords <- c("oligarch", "ecosystem", "undemocrat","plastic",
                 "nation", "trump", "patriarchi")

testscores <- wscores %>%
    filter(word %in% testwords) %>%
    arrange(score)

ggplot(wscores, aes(score, offset, label = word)) +
    geom_point(color = "grey", alpha = 0.2) +
    geom_text_repel(data = testscores, col = "black") +
    geom_point(data = testscores) +
    labs(x = "Word score (Left - Right)", y = "Offset") +
    ggtitle("Estimated position of words",
            subtitle = "Nota: Parameter offset is proportional to the frequency of each word.")

One important limitation of wordfish is that it assumes that all the documents are addressing the same topic, which is not necessarily the case. But there are scaling models for every taste (for example this and this), so this should not be too much of a problem. Still, it is incumbent upon the researcher to choose the model that best reflects the data and the needs.

11 Analyzing a corpus: Natural Language Processing (NLP)

Working with natural language is not a solved problem. Language is messy and ever-changing and evolving. It takes us the better part of our childhood to learn it, “it is hard for the scientist who attempts to model the relevant phenomena, and it is hard for the engineer who attempts to build systems that deal with natural language input or output.” (Kornai 2008))

“Statistical NLP aims to do statistical inference for the field of natural language. Statistical inference in general consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution.” (Manning and Schutze 1999)

NLP has been used for a while in software: word predictors in your phone, spell-checkers, spam filtering, etc. Its implementation in political science is still limited but the possibilities are vast. NLP makes it possible to move beyond simply establishing connections to investigating the state of relationships, from example by moving from ‘whom’ to ‘who did what to whom.’ (Welbers et al. 2017)

We will be testing three advanced NLP techniques: lemmatization, part-of-speech (POS) tagging, and dependency parsing.

11.0.1 Lemmatization

Much like stemming, but rather than cutting off words a dictionary is used to replace terms with their lemmas. More accurate at normalizing words with different verb forms (e.g. “gave” and “give”), which is a desirable quality when pre-processing a corpus.

11.0.2 Part-of-speech tagging

Syntactic categories for words, such as nouns, verbs, articles, and adjectives.

From Welbers et al. 2017: “This information can be used to focus an analysis on certain types of grammar categories, for example, using nouns and proper names to measure similar events in news items (Welbers, et al., 2016), or using adjectives to focus on subjective language (De Smedt and Daelemans, 2012).”

cap_parsed <- spacy_parse(data_capitalism$text_clean)
head(cap_parsed,25) 

11.0.3 Dependency parsing

Dependency parsing provides the syntactic relations between tokens. For example, “Kendrick” is related to “Lamar”, thus recognizing “Kendrick Lamar” as a single entity. This can be particularly useful is you are searching for certain types of entities (e.g. locations, institutions, etc.) in a corpus. We can see the differences in the location mentioned by capitalists and anti-capitalists:

cap_entities <- entity_extract(cap_parsed)
head(cap_entities,20)
# Similar to extract but converts  multi-word entities into single “tokens”:
# cap_entities <- entity_consolidate(cap_parsed_dep) 

cap_persons <- cap_entities$entity[cap_entities$entity_type == "PERSON"]
cap_persons <- unique(cap_persons)
head(cap_persons,40)
##  [1] "Ariana_Grande_Breaks"   "Bernie_Sanders"         "Bernie"                
##  [4] "Arundhati_Roy"          "Elites_Lost_Their_Grip" "\n\n_Dem"              
##  [7] "\n\n_Capitalism"        "Ilhan_Omar"             "Free_Speech"           
## [10] "Mark_Ruffalo"           "Ariana_Grande"          "John_Legend"           
## [13] "Chrissy_Teigen"         "Elizabeth_Warren"       "Harry_Haywood"         
## [16] "Angela_Davis"           "Mark"                   "We_Need_'_Revolution_'"
## [19] "Einstein"               "Wayne"                  "Nonstop"               
## [22] "Deval_Patrick_'s"       "Hitler"                 "\n\n_Me"               
## [25] "\n\n_Marx"              "Danny"                  "Greed"                 
## [28] "Deval_Patrick_’s"       "Jobs"                   "Colin"                 
## [31] "Nigel_Farage"           "Lamar"                  "Pete"                  
## [34] "Rhodes_Scholar"         "Obama"                  "\n\n_Spreading"        
## [37] "Black"                  "Hillary_Clinton"        "John_Cornyn"           
## [40] "Lech_Walesa"

In other contexts, spaCyR can recognize which subject is doing the action and which subject is on the receiving end. Van Atteveldt et al. (2017) use this to analyze who is attacking whom in news about the Gaza war. Here we can see something similar at work. Let’s….

cap_parsed_dep <- spacy_parse(data_capitalism$text_clean, dependency = TRUE, entity = TRUE, lemma = FALSE, tag = TRUE)
head(cap_parsed_dep,25) 

spacyR can also detect other attributes of tokens in a text:

cap_parsed_att <- spacy_parse(data_capitalism$text, 
            additional_attributes = c("like_num", "like_url"),
            lemma = FALSE, pos = FALSE, entity = FALSE)
head(cap_parsed_att,30)

12 Other techniques not covered

12.1 Word networks

How are words and authors connected? We can create networks of authors as nodes and edges based on the overlap in words between authors. Let’s use the textnet package created by Chris Bail:

# library(devtools)
# install_github("cbail/textnets")
library(textnets)
## Loading required package: udpipe
## Warning: package 'udpipe' was built under R version 3.5.2
## Loading required package: ggraph
## Warning: package 'ggraph' was built under R version 3.5.2
## Loading required package: networkD3
## Warning: replacing previous import 'dplyr::union' by 'igraph::union' when
## loading 'textnets'
## Warning: replacing previous import 'dplyr::as_data_frame' by
## 'igraph::as_data_frame' when loading 'textnets'
## Warning: replacing previous import 'dplyr::groups' by 'igraph::groups' when
## loading 'textnets'
# I will subset my data but lower the threshold to get more authorities:

data_capitalism_sub <- data_capitalism[data_capitalism$count_auth>4,]

## I will also concatenate all the tweets by author:
data_capitalism_sub <- data_capitalism_sub %>%
  group_by(nameauth) %>%
  mutate(text_conca = paste0(text_clean, collapse = " "))

## And finally drop dups:
data_capitalism_sub <- data_capitalism_sub[!duplicated(data_capitalism_sub$nameauth),]

# We are only going to be using nouns. 
data_capitalism_sub$text_conca_nocap <- str_remove_all(data_capitalism_sub$text_conca,"[Cc]apitalism") 

# We prep our data (takes a while):
prepped_cap <- PrepText(data_capitalism_sub, groupvar = "nameauth", textvar = "text_conca_nocap", node_type = "groups", tokenizer = "words", pos = "nouns", remove_stop_words = TRUE, compound_nouns = TRUE)
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe to /Users/sebastian/Box Sync/UH/Text Analysis Workshop/Twits/english-ewt-ud-2.4-190531.udpipe
## Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details

To create the adjacency matrix for the network, we use the CreateTextnet() function. The cells of the adjacency matrix are the transposed crossproduce of the term-frequency inverse-document frequency (TFIDF) for overlapping terms between two documents for PrepText and the matrix product of TFIDF crosspropduct (See Bail, 2016).

cap_text_network <- CreateTextnet(prepped_cap)
VisTextNet(cap_text_network, label_degree_cut = 0)
## Using `stress` as default layout

A mess… We can see the communities and compare with the communities that we got from the original network analysis.

cap_communities <- TextCommunities(cap_text_network)

cap_communities_full <- cbind.data.frame(cap_communities, data_capitalism_sub$to_membership)
colnames(cap_communities_full)[3] <- "membership_net"
cap_communities_full$modularity_class <- as.numeric(cap_communities_full$modularity_class)

ggplot(cap_communities_full, aes(x=membership_net, y=modularity_class)) +
  geom_point() +
  geom_jitter(width = 0.2, height = 0.2)

We would expect a bit more exclusivity of membership, but oh well…

12.2 Cosine-similarity

How similar are two texts? If we think of each text as a vector in space, we can think of the angle between the two vectors as their “distance”. The smaller the angle \(\theta\), the closer and the more similar. If we had a measure of each text (e.g. TF-IDF) we could compute the cosine of the angle between them. How? Math. We must solve the dot product for the \(\cos \theta\):

\[ \vec{a} \cdot \vec{b} = \|\vec{a}\|\|\vec{b}\| \cos \theta \] \[ \cos \theta = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|} \] That is the cosine similarity formula. Cosine similarity will generate a metric that says how related are two documents by looking at the angle instead of the magnitude:

Cosine similarity

Cosine similarity

12.3 Word embeddings

Word embeddings use a similar intuition as cosine similarities. By using vector representation of text, we can estimate how much alike two words are rather than treating words as single independent units (which is the case for dictionaries).

Sentiment word embedding dimensions We can train our models to predict how words are related. This requires a previously annotated training data set and some technical skills (and the pre-processing of the data). The literature suggest that word embeddings can produce better results than similar “bag-of-words” approaches. It is particularly attractive since these models can be ran in most languages, helping us overcome the limitations from canned English-focused dictionaries.

13 Goodbye Notes